Current Issue : July-September Volume : 2025 Issue Number : 3 Articles : 5 Articles
As the trend in the current generation with the use of mobile devices is rapidly increasing, online video streaming has risen to the top in the entertainment industry. These platforms have experienced radical expansion due to the incorporation of Big Data Analytics and Artificial Intelligence which are critical in improving the user interface, improving its functioning, and customization of recommended content. This paper seeks to examine how Big Data Analytics makes it possible to obtain large amounts of data about users and how they view, what they like, or how they behave. While customers benefit from this data by receiving more suitable material, getting better recommendations, and allowing for more efficient content delivery, AI utilizes it. As a result, the study also points to the importance and relevance of such technologies to promote business development, and user interaction and maintain competitiveness in the online video streaming market with examples of their effective application. This work presents a comprehensive investigation of the combined role of Big Data and AI and presents the necessary findings to determine their efficacy as success factors of existing and future video streaming services....
Live-streaming platforms such as TikTok have been recently experiencing exponential growth, attracting millions of daily viewers. This surge in network traffic often results in increased latency, even on resource-rich nodes during peak times, leading to the downgrade of Quality of Experience (QoE) for users. This study aims to predict QoE downgrade events by leveraging cross-layer device data through real-time predictions and monitoring. We propose a Real-time Multi-level Transformer (RMT) model to predict the QoE of live streaming by integrating time-series data from multiple network layers. Unlike existing approaches, which primarily assess the immediate impact of network conditions on video quality, our method introduces a device-mask pretraining (DMP) technique that applies pretraining on cross-layer device data to capture the correlations among devices, thereby improving the accuracy of QoE predictions. To facilitate the training of RMT, we further built a Live Stream Quality of Experience (LSQE) dataset by collecting 5,000,000 records from over 300,000 users in a 7-day period. By analyzing the temporal evolution of network conditions in real-time, the RMT model provides more accurate predictions of user experience. The experimental results demonstrate that the proposed pretraining task significantly enhances the model’s prediction accuracy, and the overall method outperforms baseline approaches....
We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models....
With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce STREAMMIND, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named “event-gated LLM invocation”, in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate stateof- the-art performance in both model capability and realtime efficiency, paving the way for ultra-high-FPS applications, such as Game AI and interactive media. The code and data is available at https://aka.ms/StreamMind....
The existence of redundant video frames results in a substantial waste of computational resources during video-understanding tasks. Frame sampling is a crucial technique in improving resource utilization. However, existing sampling strategies typically adopt fixed-frame selection, which lacks flexibility in handling different action categories. In this paper, inspired by the neural mechanism of the human visual pathway, we propose an effective and interpretable frame-sampling method called Entropy-Guided Motion Enhancement Sampling (EGMESampler), which can remove redundant spatio-temporal information in videos. Our fundamental motivation is that motion information is an important signal that drives us to adaptively select frames from videos. Thus, we first perform motion modeling in EGMESampler to extract motion information from irrelevant backgrounds. Then, we design an entropy-based dynamic sampling strategy based on motion information to ensure that the sampled frames can cover important information in videos. Finally, we perform attention operations on the motion information and sampled frames to enhance the motion expression of the sampled frames and remove redundant spatial background information. Our EGMESampler can be embedded in existing video processing algorithms, and experiments on five benchmark datasets demonstrate its effectiveness compared to previous fixed-sampling strategies, as well as its generalizability across different video models and datasets....
Loading....